Intro

Background

What is spotify? Spotify is a digital music, podcast, and video streaming service that gives you access to millions of songs and other content from artists all over the world. Recently spotify is one of biggest digital music and podcast service in world.

Spotify definitely is one of tech company has very advanced technology. One of example is, each of song track have uploaded to platform, they will identified. We can get audio feature information for each track, and access very easy we can use this Api link. In this case we will using spotify dataset from API from this source kaggle.

We will try to analyze popularity for each track we get, based on data we will try to find there is relation from popularity with other feature or variable. We will also try to do clustering analysis using K-means method and for sure we will find try to reduction dimensionality using Principle Component Analysis (PCA)

Dataset

We will use dataset we get from kaggle, can download from this source source)

Intial Setup and Library

# Starting collection for data science
library(tidyverse)
# Processing string
library(glue)
# Processing date data type
library(lubridate)
# Multivariate Data Analyses
library(factoextra)
# Multivariate Data Analyses
library(FactoMineR)
# Data visualization
library(ggplot2)
library(viridis)
library(GGally)
library(scales)

Import Data

The dataset we download from kaggle, we will import dataset. This dataset cotontaint audio feature from a track.

tracks <- read_csv("data/SpotifyFeatures.csv")

Observe structure and preview imported dataset

glimpse(tracks)
## Observations: 232,725
## Variables: 18
## $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie"…
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willi…
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par …
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", …
## $ popularity       <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0,…
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900…
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.4…
## $ duration_ms      <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293…
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.27…
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.12…
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "…
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.10…
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, …
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major"…
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.95…
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, …
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/…
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.53…

Variable Explaination:
1. genre : Track genre
2. artist_name : Artist name
3. track_name : Title of track
4. track_id : The Spotify ID for the track.
5. popularity : Popularity rate (1-100)
6. acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
7. danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0.
8. duration_ms : The duration of the track in milliseconds.
9. energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
10. instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
11. key : The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
12. liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
13. loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.
14. mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
15. speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
16, tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
17. time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
18. valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Data Wrangling

First of all, we would check NA or Empty value of each variable, We didnt find any NA inside data

colSums(is.na(tracks))
##            genre      artist_name       track_name         track_id 
##                0                0                0                0 
##       popularity     acousticness     danceability      duration_ms 
##                0                0                0                0 
##           energy instrumentalness              key         liveness 
##                0                0                0                0 
##         loudness             mode      speechiness            tempo 
##                0                0                0                0 
##   time_signature          valence 
##                0                0

Some variable have wront type data, we need to convery the data type:

  • genre : to factor
  • key : to factor
  • genre : to factor
  • mode: to factor
tracks <- tracks  %>% 
                  mutate(genre = as.factor(genre),
                  key = as.factor(key),
                  genre = as.factor(str_replace_all(genre, "[[:punct:]]", "")),
                  mode = as.factor(mode))

Drop variable that we think didnt related with our case. In this case we prioritize variable with numerical data type.based on summary we will drop track id, time_signature, track_name.

summary(tracks)
##              genre        artist_name         track_name       
##  Childrens Music: 14756   Length:232725      Length:232725     
##  Comedy         :  9681   Class :character   Class :character  
##  Soundtrack     :  9646   Mode  :character   Mode  :character  
##  Indie          :  9543                                        
##  Jazz           :  9441                                        
##  Pop            :  9386                                        
##  (Other)        :170272                                        
##    track_id           popularity      acousticness     danceability   
##  Length:232725      Min.   :  0.00   Min.   :0.0000   Min.   :0.0569  
##  Class :character   1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350  
##  Mode  :character   Median : 43.00   Median :0.2320   Median :0.5710  
##                     Mean   : 41.13   Mean   :0.3686   Mean   :0.5544  
##                     3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920  
##                     Max.   :100.00   Max.   :0.9960   Max.   :0.9890  
##                                                                       
##   duration_ms          energy          instrumentalness         key       
##  Min.   :  15387   Min.   :0.0000203   Min.   :0.0000000   C      :27583  
##  1st Qu.: 182857   1st Qu.:0.3850000   1st Qu.:0.0000000   G      :26390  
##  Median : 220427   Median :0.6050000   Median :0.0000443   D      :24077  
##  Mean   : 235122   Mean   :0.5709577   Mean   :0.1483012   C#     :23201  
##  3rd Qu.: 265768   3rd Qu.:0.7870000   3rd Qu.:0.0358000   A      :22671  
##  Max.   :5552917   Max.   :0.9990000   Max.   :0.9990000   F      :20279  
##                                                            (Other):88524  
##     liveness          loudness          mode         speechiness    
##  Min.   :0.00967   Min.   :-52.457   Major:151744   Min.   :0.0222  
##  1st Qu.:0.09740   1st Qu.:-11.771   Minor: 80981   1st Qu.:0.0367  
##  Median :0.12800   Median : -7.762                  Median :0.0501  
##  Mean   :0.21501   Mean   : -9.570                  Mean   :0.1208  
##  3rd Qu.:0.26400   3rd Qu.: -5.501                  3rd Qu.:0.1050  
##  Max.   :1.00000   Max.   :  3.744                  Max.   :0.9670  
##                                                                     
##      tempo        time_signature        valence      
##  Min.   : 30.38   Length:232725      Min.   :0.0000  
##  1st Qu.: 92.96   Class :character   1st Qu.:0.2370  
##  Median :115.78   Mode  :character   Median :0.4440  
##  Mean   :117.67                      Mean   :0.4549  
##  3rd Qu.:139.05                      3rd Qu.:0.6600  
##  Max.   :242.90                      Max.   :1.0000  
## 
tracks <- tracks %>% select(-c(track_id,time_signature,track_name))

Exploratory Data Analysis

From dataset we get, we found we have genre whish is we can group our data base on it. To make us focus on popularity variable, we would select 5 highest average genre. It can inteprate the genre have big distribution from low to highest popularity. We will visualize data:

genre_popularity <- tracks %>% select(popularity, genre) %>% group_by(genre) %>% summarise("average_popularity" = round(mean(popularity)))

ggplot(data=genre_popularity, mapping = aes(x = reorder(genre,average_popularity), y = average_popularity, fill = genre)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  theme(
    legend.position = "none",
    
  ) +
  labs(
    y = "Average popularity",
    x = "Genre"
  )

Top 5 Genre with highest average popularity is Pop, Rap, Rock, HipHop and Dance. We filter our dataset and select only this 5 genres.

# Filter
tracks <- tracks %>% filter(genre == "Pop" | genre == "Rap" | genre == "Rock" | genre == "HipHop" | genre == "Dance")

# Total row
NROW(tracks)
## [1] 45886

Clustering Opportunity

Before we use k-mean as method to clustering, we can use simple way to clustering some factor variable with popularity variable. Here we try to viasualize boxplot popularity with key and genre

tracks %>% 
  ggplot(aes(x = mode, y = popularity, fill = mode)) +
  geom_boxplot() +
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  theme_minimal()

tracks %>% 
  ggplot(aes(x = genre, y = popularity, fill = genre)) +
  geom_boxplot() +
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  theme_minimal()

From both bar, we can see in general genre and key didnt have significant relation to popularity. Even so, there is difference in plot popluaritu and genre when track overall using Key A# popularity slighlty higher and more have stable popularity than others.

Other things, we found that Pop genre have more stable popluarity than other 4 genre. So we can consider an opinion, if producer want get more popularity in the spotify platform, we can make tracks song with Pop genre and overall key in tracks using A# key.

We only visualize genre and key before, and we found each correlation with popularity even though, cant significantly which variable can significantly increase popularity. So we will try visualize other numerical variable to see correlation between them.

ggcorr(tracks, low = "blue", high = "red")
## Warning in ggcorr(tracks, low = "blue", high = "red"): data in column(s)
## 'genre', 'artist_name', 'key', 'mode' are not numeric and were ignored

It show popularity dont have strong correlation with others any numberical variable. But we found some variable have strong each other, it indicates that this dataset has multicollinearity and might not suitable for various classification algorithms.

To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means. We will use Principal Component Analysis (PCA) can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.

Data Pre-processing

Since we will implement K-means method and using PCA, its we will perform pre-processing data

hist(tracks$popularity)

set.seed(100)
tracks_sample <- sample_n(tracks, (nrow(tracks) * 0.05))
NROW(tracks_sample)
## [1] 2294
hist(tracks_sample$popularity)

str(tracks_sample)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2294 obs. of  15 variables:
##  $ genre           : Factor w/ 26 levels "A Capella","Alternative",..: 12 9 18 17 12 18 22 17 22 12 ...
##  $ artist_name     : chr  "Talib Kweli" "Andy Grammer" "Mac Miller" "Simple Plan" ...
##  $ popularity      : num  48 55 50 65 67 82 67 62 52 66 ...
##  $ acousticness    : num  0.263 0.0409 0.109 0.000491 0.039 0.0776 0.00118 0.154 0.00487 0.73 ...
##  $ danceability    : num  0.629 0.621 0.438 0.522 0.673 0.643 0.528 0.699 0.542 0.425 ...
##  $ duration_ms     : num  227373 199113 197852 232067 229507 ...
##  $ energy          : num  0.787 0.827 0.792 0.751 0.758 0.904 0.858 0.668 0.714 0.406 ...
##  $ instrumentalness: num  0 0 0 0.00000222 0 0 0.00000156 0.0000032 0.298 0.00000359 ...
##  $ key             : Factor w/ 12 levels "A","A#","B","C",..: 11 3 9 5 11 4 11 9 6 8 ...
##  $ liveness        : num  0.357 0.0815 0.241 0.158 0.341 0.189 0.282 0.362 0.334 0.107 ...
##  $ loudness        : num  -5.62 -7.31 -6.88 -5.46 -3.63 ...
##  $ mode            : Factor w/ 2 levels "Major","Minor": 1 1 2 1 1 1 1 2 1 2 ...
##  $ speechiness     : num  0.376 0.0454 0.346 0.0435 0.158 0.0739 0.0493 0.0336 0.028 0.176 ...
##  $ tempo           : num  85.7 100 175.9 139.5 136 ...
##  $ valence         : num  0.713 0.65 0.463 0.605 0.542 0.481 0.219 0.314 0.76 0.124 ...
tracks_num <- tracks_sample %>% select(-c(genre,artist_name,key,mode))
str(tracks_num)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2294 obs. of  11 variables:
##  $ popularity      : num  48 55 50 65 67 82 67 62 52 66 ...
##  $ acousticness    : num  0.263 0.0409 0.109 0.000491 0.039 0.0776 0.00118 0.154 0.00487 0.73 ...
##  $ danceability    : num  0.629 0.621 0.438 0.522 0.673 0.643 0.528 0.699 0.542 0.425 ...
##  $ duration_ms     : num  227373 199113 197852 232067 229507 ...
##  $ energy          : num  0.787 0.827 0.792 0.751 0.758 0.904 0.858 0.668 0.714 0.406 ...
##  $ instrumentalness: num  0 0 0 0.00000222 0 0 0.00000156 0.0000032 0.298 0.00000359 ...
##  $ liveness        : num  0.357 0.0815 0.241 0.158 0.341 0.189 0.282 0.362 0.334 0.107 ...
##  $ loudness        : num  -5.62 -7.31 -6.88 -5.46 -3.63 ...
##  $ speechiness     : num  0.376 0.0454 0.346 0.0435 0.158 0.0739 0.0493 0.0336 0.028 0.176 ...
##  $ tempo           : num  85.7 100 175.9 139.5 136 ...
##  $ valence         : num  0.713 0.65 0.463 0.605 0.542 0.481 0.219 0.314 0.76 0.124 ...
tracks_scale <- scale(tracks_num)
fviz_nbclust(tracks_num, kmeans, method = "wss", k.max = 15) +
  scale_y_continuous(labels = number_format(scale = 10^(-9), big.mark = ",", suffix = " bil.")) +
  labs(subtitle = "Elbow method")

fviz_nbclust(tracks_num, kmeans, method = "silhouette", k.max = 15) 

fviz_nbclust(tracks_num, kmeans, "gap_stat", k.max = 10) + labs(subtitle = "Gap Statistic method")

set.seed(123)
km_tracks <-kmeans(tracks_scale, centers = 2)
km_tracks
## K-means clustering with 2 clusters of sizes 736, 1558
## 
## Cluster means:
##     popularity acousticness danceability duration_ms     energy
## 1  0.013047091    0.7995148  -0.13930852  0.12854175 -1.0607826
## 2 -0.006163453   -0.3776912   0.06580942 -0.06072319  0.5011143
##   instrumentalness   liveness   loudness speechiness      tempo    valence
## 1       0.19655118 -0.2476437 -0.9122186 -0.07286271 -0.2282781 -0.5674398
## 2      -0.09285088  0.1169870  0.4309325  0.03442038  0.1078387  0.2680588
## 
## Clustering vector:
##    [1] 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 2 2 2 2 2 1 2 1 1 1 2 2 1 1 1 2 2 2 1 2 1
##   [38] 1 1 2 2 1 1 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 1 2 1 2 2 2 2
##   [75] 2 1 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 1 2 2 2 1 2 1 2 1 1 1 2 1 1 2 1 2 1 1 1
##  [112] 2 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 1 2 1 2 2 1 1 2 1 2
##  [149] 1 2 2 2 1 1 2 1 1 2 2 1 2 1 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2
##  [186] 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2
##  [223] 1 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 1 1 2 2 2 2 1 2 1 1 1 2 1 2 1
##  [260] 2 1 2 2 2 1 2 1 2 2 2 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2 2
##  [297] 1 1 2 2 1 1 2 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 2 2
##  [334] 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1
##  [371] 1 1 2 2 2 2 1 1 2 1 2 2 1 2 2 2 2 2 1 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1
##  [408] 2 2 1 1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1
##  [445] 1 2 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 2 2 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 2 2
##  [482] 2 1 1 2 2 2 2 1 1 2 1 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 1 1 1 2 2 1
##  [519] 2 2 1 2 1 2 2 2 2 2 2 1 2 1 1 1 2 1 1 2 1 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 1
##  [556] 2 1 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 1 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1 1
##  [593] 1 2 1 2 2 2 1 1 2 2 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 1
##  [630] 2 1 2 2 2 1 2 1 2 2 2 2 1 1 1 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2
##  [667] 2 2 2 2 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1
##  [704] 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2
##  [741] 2 1 2 1 2 1 2 2 2 2 1 2 1 2 2 2 1 2 1 2 1 2 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2
##  [778] 1 2 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1
##  [815] 1 2 1 1 2 1 2 2 2 1 2 2 1 1 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2
##  [852] 2 2 1 2 2 1 1 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 1 1
##  [889] 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 1
##  [926] 1 1 2 1 2 1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 2 1 2 2 1
##  [963] 2 2 2 1 2 2 2 2 1 1 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2
## [1000] 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2
## [1037] 2 1 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 1 1 2 2 1 2 2 2 1 2 1 2 2 1 1 1
## [1074] 1 2 1 2 1 1 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 1
## [1111] 2 2 1 2 1 1 2 2 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 2 2 2 2
## [1148] 2 1 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 2 1 1 2 2
## [1185] 1 2 1 1 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2
## [1222] 1 2 1 1 1 2 2 1 1 1 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 2 2 1 1 1 2 2 1
## [1259] 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## [1296] 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 1 2 2
## [1333] 2 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 1
## [1370] 2 2 2 2 2 2 1 2 2 2 1 1 2 2 1 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2 1 2 1 2 2 2 2
## [1407] 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 2 1 2 1 2
## [1444] 2 2 2 1 2 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1481] 1 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 1
## [1518] 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 1
## [1555] 2 1 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 2 1 2 1 2 2 2
## [1592] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 2 2 1
## [1629] 2 1 1 2 2 2 2 1 1 2 2 1 1 1 2 1 2 2 2 1 1 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 2
## [1666] 1 2 1 1 2 2 2 1 2 1 1 2 1 2 2 1 2 2 1 1 2 1 1 1 2 1 2 2 2 2 2 2 2 1 2 1 2
## [1703] 2 2 2 2 2 2 2 2 2 2 1 1 1 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1
## [1740] 1 2 1 2 2 1 2 2 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2
## [1777] 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2 1 2 1 1 1 2 2 1
## [1814] 2 2 2 2 2 2 2 1 2 2 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 1 2 2 2 2
## [1851] 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1
## [1888] 2 1 2 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2
## [1925] 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2
## [1962] 2 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2
## [1999] 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 2
## [2036] 1 2 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 1 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2 2
## [2073] 1 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 1 1 2 1 2 2 2 2 1 2 1 1 1 2 2
## [2110] 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 1 1 2 2 1 1
## [2147] 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 1 1 2 2 2 1 2 2 2 1 2 2 1
## [2184] 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2
## [2221] 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 1
## [2258] 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1]  8839.964 13010.500
##  (between_SS / total_SS =  13.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
tracks_sample$cluster <- as.factor(km_tracks$cluster)
tracks_sample
## # A tibble: 2,294 x 16
##    genre artist_name popularity acousticness danceability duration_ms energy
##    <fct> <chr>            <dbl>        <dbl>        <dbl>       <dbl>  <dbl>
##  1 HipH… Talib Kweli         48     0.263           0.629      227373  0.787
##  2 Dance Andy Gramm…         55     0.0409          0.621      199113  0.827
##  3 Rap   Mac Miller          50     0.109           0.438      197852  0.792
##  4 Pop   Simple Plan         65     0.000491        0.522      232067  0.751
##  5 HipH… Pitbull             67     0.039           0.673      229507  0.758
##  6 Rap   Jason Deru…         82     0.0776          0.643      195419  0.904
##  7 Rock  Sundara Ka…         67     0.00118         0.528      226395  0.858
##  8 Pop   Bazzi               62     0.154           0.699      148230  0.668
##  9 Rock  Soda Stereo         52     0.00487         0.542      212893  0.714
## 10 HipH… 6LACK               66     0.73            0.425      286957  0.406
## # … with 2,284 more rows, and 9 more variables: instrumentalness <dbl>,
## #   key <fct>, liveness <dbl>, loudness <dbl>, mode <fct>, speechiness <dbl>,
## #   tempo <dbl>, valence <dbl>, cluster <fct>
tracks_sample %>% ggplot(aes(x = duration_ms, y = popularity, color = cluster)) +
  geom_point() +
  theme_minimal()

fviz_cluster(object = km_tracks, data = tracks_scale) + 
  theme_minimal()

non_numeric <- which(sapply(tracks_sample, negate(is.numeric)))

tracks_pca <- PCA(tracks_sample,
                  scale.unit = T,
                  quali.sup = non_numeric,
                  graph = F,
                  ncp = 15)

summary(tracks_pca)
## 
## Call:
## PCA(X = tracks_sample, scale.unit = T, ncp = 15, quali.sup = non_numeric,  
##      graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.309   1.505   1.142   1.027   0.985   0.948   0.847
## % of var.             20.987  13.682  10.379   9.332   8.955   8.619   7.696
## Cumulative % of var.  20.987  34.668  45.047  54.379  63.334  71.954  79.650
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.833   0.645   0.543   0.217
## % of var.              7.575   5.867   4.937   1.971
## Cumulative % of var.  87.225  93.092  98.029 100.000
## 
## Individuals (the 10 first)
##                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                |  3.250 |  1.012  0.019  0.097 |  0.942  0.026  0.084 |
## 2                |  2.008 |  0.805  0.012  0.161 | -0.047  0.000  0.001 |
## 3                |  3.315 |  0.852  0.014  0.066 | -0.663  0.013  0.040 |
## 4                |  1.841 |  1.040  0.020  0.319 | -0.978  0.028  0.282 |
## 5                |  2.029 |  1.579  0.047  0.606 | -0.136  0.001  0.004 |
## 6                |  3.075 |  1.616  0.049  0.276 | -0.091  0.000  0.001 |
## 7                |  2.559 |  1.031  0.020  0.162 | -1.282  0.048  0.251 |
## 8                |  2.441 |  0.762  0.011  0.097 | -0.048  0.000  0.000 |
## 9                |  3.198 |  0.304  0.002  0.009 | -1.388  0.056  0.188 |
## 10               |  4.496 | -2.847  0.153  0.401 | -1.205  0.042  0.072 |
##                   Dim.3    ctr   cos2  
## 1                 2.302  0.202  0.502 |
## 2                -0.555  0.012  0.076 |
## 3                 1.927  0.142  0.338 |
## 4                -0.786  0.024  0.182 |
## 5                 0.188  0.001  0.009 |
## 6                -1.807  0.125  0.345 |
## 7                -0.480  0.009  0.035 |
## 8                -0.049  0.000  0.000 |
## 9                 0.976  0.036  0.093 |
## 10                0.025  0.000  0.000 |
## 
## Variables (the 10 first)
##                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## popularity       |  0.002  0.000  0.000 |  0.095  0.596  0.009 | -0.617 33.386
## acousticness     | -0.647 18.115  0.418 |  0.158  1.653  0.025 | -0.003  0.001
## danceability     |  0.125  0.673  0.016 |  0.769 39.243  0.591 | -0.018  0.030
## duration_ms      | -0.149  0.966  0.022 | -0.471 14.759  0.222 |  0.122  1.306
## energy           |  0.872 32.917  0.760 | -0.267  4.720  0.071 |  0.015  0.021
## instrumentalness | -0.310  4.171  0.096 | -0.405 10.926  0.164 |  0.156  2.131
## liveness         |  0.237  2.434  0.056 | -0.099  0.654  0.010 |  0.595 31.035
## loudness         |  0.818 28.979  0.669 | -0.112  0.831  0.013 | -0.149  1.951
## speechiness      |  0.083  0.298  0.007 |  0.482 15.415  0.232 |  0.578 29.288
## tempo            |  0.159  1.093  0.025 | -0.254  4.297  0.065 |  0.084  0.612
##                    cos2  
## popularity        0.381 |
## acousticness      0.000 |
## danceability      0.000 |
## duration_ms       0.015 |
## energy            0.000 |
## instrumentalness  0.024 |
## liveness          0.354 |
## loudness          0.022 |
## speechiness       0.334 |
## tempo             0.007 |
## 
## Supplementary categories (the 10 first)
##                       Dist     Dim.1    cos2  v.test     Dim.2    cos2  v.test
## Dance            |   0.678 |   0.273   0.162   3.998 |  -0.339   0.249  -6.138
## HipHop           |   0.843 |   0.024   0.001   0.382 |   0.592   0.492  11.692
## Pop              |   0.735 |  -0.082   0.013  -1.334 |   0.025   0.001   0.491
## Rap              |   0.678 |   0.104   0.024   1.675 |   0.457   0.456   9.100
## Rock             |   1.084 |  -0.283   0.068  -4.512 |  -0.786   0.526 -15.517
## *NSYNC           |   3.488 |   2.905   0.694   1.912 |   0.196   0.003   0.159
## $uicideBoy$      |   2.143 |  -0.348   0.026  -0.648 |   0.681   0.101   1.572
## 03 Greedo        |   2.493 |   1.291   0.268   0.850 |   0.623   0.062   0.508
## 070 Shake        |   2.524 |  -1.786   0.501  -1.176 |   0.331   0.017   0.269
## 10cc             |   5.564 |  -3.790   0.464  -2.494 |  -1.795   0.104  -1.463
##                      Dim.3    cos2  v.test  
## Dance            |   0.015   0.000   0.313 |
## HipHop           |   0.500   0.351  11.333 |
## Pop              |  -0.534   0.528 -12.296 |
## Rap              |   0.259   0.146   5.912 |
## Rock             |  -0.228   0.044  -5.167 |
## *NSYNC           |   0.245   0.005   0.229 |
## $uicideBoy$      |   0.337   0.025   0.892 |
## 03 Greedo        |  -0.436   0.031  -0.408 |
## 070 Shake        |   0.424   0.028   0.397 |
## 10cc             |   0.606   0.012   0.567 |
fviz_eig(tracks_pca, ncp = 11, addlabels = T, main = "Variance Explained by Dimensions")

non_numeric
##       genre artist_name         key        mode     cluster 
##           1           2           9          12          16
fviz_pca_ind(tracks_pca, habillage = 1)

fviz_pca_var(tracks_pca) +
  theme_minimal()

fviz_cluster(object = km_tracks, data = tracks_scale) + 
  theme_minimal()